R Markdown Tutorial
This is an introduction to data visualization (in R).
We will be using ggplot2 to plot our data. Let’s make sure you have that installed.
## Loading required package: ggplot2
## Loading required package: tidyr
## Loading required package: forcats
We will use the data from the Baker 2016 paper, again.
# Load in our data:
baker2016 = read.table('baker2016_artificial.txt', header = T)| job | familiar_reproducibility | significance_crisis | proportion_reproducible |
|---|---|---|---|
| PhD Student | Fairly familiar | There is a slight crisis of reproducibility | 70% |
| Student | Very familiar | There is a significant crisis of reproducibility | 50% |
| PhD Student | Fairly familiar | There is a significant crisis of reproducibility | 60% |
| Student | Very familiar | There is a slight crisis of reproducibility | 20% |
| PhD Student | Very familiar | There is a slight crisis of reproducibility | 90% |
| PhD Student | Very familiar | There is a slight crisis of reproducibility | 80% |
Basic Plot Types
The basic structure of a gpplot looks as follows. You call ggplot, pass over some data, specify aesthetics using aes() and detmine the geometry using one (or several) of the following (or many more) options:
geom_point()geom_bar()geom_boxplot()geom_col()geom_hist()geom_density()
Let’s create our first plot of the data. We are plotting a barplot of the data’s column familiar_reproducibility which (non surprisingly) shows us how familiary respondents were with the term ‘Open Science’.
ggplot(baker2016, aes(x= flagged)) +
geom_bar()Now imagine we wanted to use that same plot structure to plot some other variable, i.e the respondents research field. To do so, we’d just replace x=familiar_reproducibility with x=field. Easy. Remember, you can check what other columns are available by running colnames(baker2016) on the command line. Try out a few.
Let’s try out a different plot type. Histograms can be a great way to plot data distributions. Let’s plot participants’ responses to the question ‘What percentage of the research in your field do you think is reproducible?’. The column you are looking for is called proportion_reproducible, so we would plot the histogram as follows:
ggplot(baker2016, aes(x = proportion_reproducible)) +
geom_histogram(stat="count")What is your message?
The participants in the study came from different fields. We may be interested in seeing how the responses differed between the fields. GGplot allows you to use color and grouping options in plots. I.e. we may create a stacked histogramm by telling ggplot to fill the bars by field, specifying fill=field. In other plots you may choose the option color=field - color usually refers to the outlines of plotted items, whereas fill well… “fills” them. Feel free to five this a try on the histogramm to see the difference.
ggplot(baker2016, aes(group= field, fill=field, x = proportion_reproducible)) +
geom_histogram(stat='count')Remember we wanted to get an idea of how responses differed between the fields. What are we learning from the previous plot? Probably, not much. So while a histogramm may have helped us understand that most participants appear to have a great amount of trust in the results in their fields, a stacked histogramm might not be the plot that allows us to understand how results differ between fields.
A good way to disect differences between groups in your data is to use facet_wrap to create different panels for each group. Simply add facet_wrap(facets = baker2016$field) to get one histogram per group. Hopefully, this helps us see how the distributions differ between the fields.
ggplot(baker2016, aes(group= field, fill=field, x = proportion_reproducible)) +
geom_histogram(stat = 'density') +
facet_wrap(facets = baker2016$field)Fitting data
The original Baker 2016 data unfortunately did not include continuous variables. I have created artificial data for us to practice plotting continous variables. dummy_dat2 is an artificial response, and dummy_age is an age vector I created based on the age bins from the original data.
Let’s plot the effects of dummy_age on dummy_dat2.
ggplot(baker2016, aes(x=dummy_age, y=dummy_dat2)) +
geom_point()That looks like a decent correlation. Let’s fit a linear model to this, regressing dummy_age on dummy_dat2. We can fit models to our data with geom_smooth(). Model options are linear (lm), GLM (glm), smoothing splines (gam), and loess regression (loess). We are using a simple linear model.
ggplot(baker2016, aes(x=dummy_age, y=dummy_dat2, color=as.factor(dummy_group))) +
geom_point()+
geom_smooth(method = 'lm')Lastly, I have also created an artificial variable dummy_group. You can analyse group effects on your model by simply specifying the color option with your grouping variable. This will then fit two separate linear regressions. In this case we learn that the grouping variable does not appear to have an effect on our data:
ggplot(baker2016, aes(x=dummy_age, y=dummy_dat2, color=as.factor(dummy_group))) +
geom_point()+
geom_smooth(method = 'lm')Statistics in figures - or how not to fool your audience (and yourself)
Choosing how to plot your data is crucial to communicate your message. As a first step though you may often use plots to get an idea of what your data looks like. In both cases, make sure to choose a representation that enables you to understand your data well.
Here, I am creating an artificial dataset to illustrate some points on how (not to) plot your data.
response1 = rnorm(n = 1000, mean = 10, sd = c(1,2))
response2 = rnorm(n = 1000, mean = c(2,3), sd = 1)
response3 = rnorm(n = 1000, mean = c(4,5), sd = c(1,1.5))
group = rep(letters[1:2], length.out = 100)
df = data.frame(group = as.factor(group), response1=response1, response2=response2, response3=response3)Let’s have a look at response1. You may think it’s a good idea to first plot this data as a barplot, displaying the mean value for response1 in groups a and b. This is what such a plot would look like:
ggplot(df, aes(x=group,y=response1, color=group, fill=group, alpha=0.2)) +
geom_bar(stat='summary', fun=mean) Now that’s not very helpful. It looks as if both groups are the same.
Now you may have noticed that when I created the data, I intentionally created both groups to have the same mean value. However, their standard deviations are very different!
The following plots display the exact same data - but now in a more useful format.
par(mfrow=c(2,1))
ggplot(df, aes(x=group,y=response1, color=group, fill=group, alpha=0.2)) +
geom_violin()
ggplot(df, aes(x=group,y=response1, color=group, fill=group, alpha=0.2)) +
geom_boxplot()Consider having a play around with response2 and response3 to see what kind of plots hide or highlight the differences between those variables.
Wide and Long Format
The data as it is right now is in the so-called wide format. For many types of plots that maynot be useful.
Data in the wide format displays repeated responses in a single row, and each response is in a separate column. Let’s have a look at the question: ‘Have you ever tried to replicate a result, either your own or someone elses, and failed?’. Subjects can say yes or no to both the idea that they have tried to replicate someone else’s results and failed, coded in try_failOwn, and that they have tried to replicate their own work and failed, coded in try_failElse. In the wide format each subject is a row, and each responses is coded as their own column:
| responseid | try_failOwn | try_failElse | field |
|---|---|---|---|
| 112 | Yes | Yes | Biology |
| 777 | Yes | Yes | Biology |
| 1818 | No | No | Biology |
| 1843 | Yes | No | Biology |
| 2007 | Yes | Yes | Biology |
| 2257 | Yes | No | Biology |
An alternative representation would be to have a column indicating which part of the question subjects repsonded to, i.e. own or someone else’s, with each subject being represented twice: One row encodeing their response to the first, the second row encoding their response to the second question.
tmp = gather(baker2016[,c(1,20,21,27)], question, response, 2:3)
knitr::kable(
head(tmp[order(tmp$responseid),]), # This is the table we will plot
booktabs = TRUE)| responseid | field | question | response | |
|---|---|---|---|---|
| 1546 | 24 | Other | try_failOwn | Yes |
| 3122 | 24 | Other | try_failElse | Yes |
| 662 | 27 | Other | try_failOwn | No |
| 2238 | 27 | Other | try_failElse | No |
| 1297 | 36 | Other | try_failOwn | No |
| 2873 | 36 | Other | try_failElse | Yes |
In order to get our data into this format, we make use of the gather function from the tidyr package. This functions takes in the arguments
data, i.e. the data in wide format to be transformed into long formatkey, the variable name of a new column which will enumerate the different categories possiblevalue, the variable name of a new column which will contain the actual data- and a selection of columns which will be converted into the entries in your new column.
data_wide = baker2016[,c(20,21,27)] # This is the data we will be using for our plot in wide format
data_long = gather(data = data_wide, key = question, value = response, 1:2) # Convert the data into long formatThe data in wide format allows us to plot something like this for example:
# Delete responses where participants couldn't remember the result
data_long = data_long[-which((data_long$response == "I can't remember")),]
ggplot(data_long[data_long$question=='try_failElse',], aes(x = field, fill=response)) +
geom_bar(aes(y = ..count../tapply(..count.., ..x.. ,sum)[..x..]), position = 'dodge',group = 1) +
scale_fill_manual(values = c("red", "orange"))+
coord_flip()Prettier plots
You can change a whole range of aspects about a theme. A comprehensive list of options can be found here. Here are a few basic things to get started with:
Color
You can can change the colors of your plot manually, or by using a colormap. In order to manually determine the colors used, add scale_fill_manual(value = c('color1','color2',...)).
R also supplies a range of predefined colormaps you can use. Have a look on page 4 of this cheatsheet for a decent overview.
ggplot(data_long[data_long$question=='try_failElse',], aes(x = field, fill=response)) +
geom_bar(aes(y = ..count../tapply(..count.., ..x.. ,sum)[..x..]), position = 'dodge',group = 1) +
scale_fill_manual(values = c("red", "orange"))+
coord_flip()Instead of manually defining the colors above, we could also have used a predefined discrete color scale, by adding scale_fill_brewer(palette = 'Your brewer color scale').
ggplot(data_long[data_long$question=='try_failElse',], aes(x = field, fill=response)) +
geom_bar(aes(y = ..count../tapply(..count.., ..x.. ,sum)[..x..]), position = 'dodge',group = 1) +
scale_fill_brewer(palette = 'Set3')+
coord_flip()Lastly, let’s consider using color to highlight. You can add a conditional argument to the color or fill options to highlight certain aspects of your plot. I.e. we may want to highlight the participants who responded saying they weren’t familiar enough with the term Open Science - for example in order to convince the department to run additional workshops on Open Science.
In order to highlight this aspect, we add the conditional argument ifelse(flagged=='Not enough',1,2) to our coloroption. Wrapping it in as.factor() simply makes is a discrete variable, which would allow us to choose a discrete colormap.
ggplot(baker2016, aes(x= flagged, fill = as.factor(ifelse(flagged=='Not enough',1,2)))) +
geom_bar()+
xlab('Response')+
ylab('Count')+
ggtitle("How familiar are you with the term 'Open Science'")Reordering factor
In the previous plots, the order of bars wasn’t necessarily helpful at making the graph more easily understandable. Sometimes small things like reordering bars can make the figure more visually appealing and help direct the viewer’s attention to your message.
When factor variables are plotted, the ordering is done by the factor levels. In our case that’s the variable flagged. You can check what your factor levels are using levels(baker2016$flagged). Currently the order is "A reasonable amount","I am unsure","Not enough","Too much".
You could choose to manually reorder your factor levels, or simply make use of the fct_infreq() function from the forcats package, which will sort your factor levels by frequency.
ggplot(baker2016, aes(x= fct_infreq(flagged), fill = as.factor(ifelse(flagged=='Not enough',1,2)))) +
geom_bar()Point shapes
You can edit the shapes of your scatter points using a shape from this table by adding shape to geom_point().
ggplot(baker2016, aes(x=dummy_age, y=dummy_dat2)) +
geom_point(shape=8)+
geom_smooth(method = 'lm')Alpha
A super simple way to make your plots look softer can be to add transparency to them. Going back to our scatter plot from earlier, this may actually help to direct attention to important feature of the graph like your regression line and make overlapping data points more distinguishable. Simply add the argument alpha.
ggplot(baker2016, aes(x=dummy_age, y=dummy_dat2, alpha = 0.2, color=as.factor(dummy_group))) +
geom_point()+
geom_smooth(method = 'lm')Axis labels
Add a title using ggtitle('Your title goes here.') and change the axis lables using xlab('X axis title') and ylab('X axis title').
ggplot(baker2016, aes(x= fct_infreq(flagged), fill = as.factor(ifelse(flagged=='Not enough',1,2)))) +
geom_bar()+
xlab('Response')+
ylab('Count')+
ggtitle("How familiar are you with the term 'Open Science'")Change the theme:
Ggplot has built in themes that you can make use of. These include: theme_bw(), theme_classic(), and theme_void(). Try out a few and see the effects on this plot:
ggplot(baker2016, aes(x= fct_infreq(flagged), fill = as.factor(ifelse(flagged=='Not enough',1,2)))) +
geom_bar()+
xlab('Response')+
ylab('Count')+
ggtitle("How familiar are you with the term 'Open Science'")+
theme_bw()Edit legend
By default, ggplot uses the factor names from your data in the legend, and the respective variable name as the title.
There are a variety of ways you can edit the legend. Here, we will use labs().
Since our legend refers to the fill color of the plot, you will have to reference fill when editing it. You may not want to have a legent title at all, in which case you could add: labs(fill=''). Or maybe you would like to rename it to Questionnaire Response using labs(fill='Questionnaire Response')? Alternatively, you could also decide the legend isn’t adding much value here and leave it out entirely by adding legend.position = "none".
ggplot(baker2016, aes(x= fct_infreq(flagged), fill = as.factor(ifelse(flagged=='Not enough',1,2)))) +
geom_bar()+
xlab('Response')+
ylab('Count')+
ggtitle("How familiar are you with the term 'Open Science'")+
theme_bw() +
labs(fill='') Save plots
Now that we have created lovely plots, we may want to save them as files in order to share and publish them. An easy way to do so is to click on Export in the Plots panel.
More conveniently maybe, we can add a bit of code for the plots to automatically autput to a location on our computer. By wrapping your plot in pdf('/path/to/your/file.pdf', width, height) ... dev.off(), png('/path/to/your/file.png', width, height) ... dev.off(), or jpeg('/path/to/your/file.jpeg', width, height) ... dev.off() you can save your plot to different file formats. Note that weirdly width and height are in inches for pdf(), whereas for jpeg() and png() they are in pixels. By default the plot should save to your current working directory, so you should change the path to whereever you would love to output the file to.
pdf('tmp.pdf',4,5)
ggplot(baker2016, aes(x= fct_infreq(flagged), fill = as.factor(ifelse(flagged=='Not enough',1,2)))) +
geom_bar()+
xlab('Response')+
ylab('Count')+
ggtitle("How familiar are you with the term 'Open Science'")+
theme_bw() +
theme(legend.position = "none")
dev.off()## quartz_off_screen
## 2
Further Resources
This website could be considered the bible of plotting data in R. This ggplot cheatsheet is great to have at hand.